When people created their character representation, they needed a number of unique identifier to represent each characters. That’s when they came up with ASCII which uses 7 bits to represent a character unique. By using 7 bits, there are a maximum of 2\(^7\) (= 128) distinct combinations. This means that a maximum of 128 characters can be represented.
One might ask, why 7 bits? Why not just 1 byte (8 bits)? The last bit (8th) is reserved as the parity bit to avoid errors in communication.
ASCII characters includes:
See below the binary representation of a few example characters represented in ASCII:
As you may have noticed, ASCII is designed to support only English since the center of the computer industry was in America at that time. As a consequence, they didn’t need to support other latin characters such as á, ü, ç, ñ, etc. (aka diacritics).
In the need of other latin characters, a group of people started using the last 8th bit (instead of using it as Parity bit) to encode more characters (For example, “á”). Just using one extra bit doubled the size of the original ASCII table to map up to 256 characters (2\(^8\) = 256 characters) instead of 2\(^7\) (128) as before.
See below binary representation of a new character set which uses the last 8th bit:
The name for this “ASCII extended to 8 bits and not 7 bits as before” could be just referred as “extended ASCII” or “8-bit ASCII”.
To note, there are variations of the 8-bit ASCII table because people use it for their different purpose. For example, the ISO 8859-1, also called ISO Latin-1.
ASCII Extended solved the problem for languages that are based on the Latin alphabet. What about the other languages needing a completely different looking character? Korean? Chinese? Japanese? (CJK) Russian and the likes?
To encode them and display properly, we needed an entirely a new character set. That’s the rationale behind Unicode. Unicode doesn’t have every character from every language, but it sure contains a gigantic amount of characters (see this table, especially Chinese as their characters are to be unique themselves).
There is no such thing as “save as” Unicode because Unicode is an abstract representation of the text. You need to “encode” this abstract representation. That’s where a character encoding comes into play.
UTF-8 and UTF-16 are variable-length encodings. For UTF-8 characters, if a character can be represented using a single byte (because its code point is a very small number), UTF-8 will encode it with a single byte. If it requires two bytes, it will use two bytes and so on. Similar concept for UTF-16 which requires a minimum of 16 bits though. So it will start with 16 bits and 16 bits more if needed. However, UTF-32 are fixed to 4 bytes to represent a character.
Please take a look at the following table, you should have a better understanding of each after.
bits | encoding | characters |
---|---|---|
01000001 | UTF-8 | A |
00000000 01000001 | UTF-16 | A |
00000000 00000000 00000000 01000001 | UTF-32 | A |
11100011 10000001 10000010 | UTF-8 | あ |
00110000 01000010 | UTF-16 | あ |
00000000 00000000 00110000 01000010 | UTF-32 | あ |
The ingenious thing about UTF-8 is that it’s binary compatible with ASCII, which is the de-facto baseline for all encodings. Since UTF-16 and UTF-32 use their minimum requirements of 2/4 bytes (16/32 bits). This makes it incompatible with ASCII because ASCIIs are represented using 7-8 bits.
There are many encodings which have a slight variation to above ASCII/Unicode. Any character can be encoded in many different bit sequences and any particular bit sequence can represent many different characters. It all depends on which encoding is used to read or write them.
See below table for few of the encodings that are used out there.
bits | encoding | characters |
---|---|---|
11000100 01000010 | Windows Latin 1 | ÄB |
11000100 01000010 | Mac Roman | ƒB |
11000100 01000010 | GB18030 | 腂 |
bits | encoding | characters |
---|---|---|
01000110 11111000 11110110 | Windows Latin 1 | Føö |
01000110 10111111 10011010 | Mac Roman | Føö |
01000110 11000011 10111000 11000011 10110110 | UTF-8 | Føö |
This is the reason why you see garbled texts when you open up something in a text editor or even those text bits can be corrupted to show weird characters. As long as you know what encoding a certain piece of text, that is, a certain byte sequence, is in, then the text will be interpreted well with that encoding.